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ABSTRACT 

Clipping or saturation in audio signals is a very common 
problem in signal processing, for which, in the severe case, 
there is still no satisfactory solution. In such case, there is 
a tremendous loss of information, and traditional methods 
fail to appropriately recover the signal. We propose a novel 
approach for this signal restoration problem based on the 
framework of Iterative Hard Thresholding. This approach, 
which enforces the consistency of the reconstructed signal 
with the clipped observations, shows superior performance in 
comparison to the state-of-the-art declipping algorithms. This 
is confirmed on synthetic and on actual high-dimensional au- 
dio data processing, both on SNR and on subjective user 
listening evaluations. 

Index Terms — Signal Clipping, Sparse Recovery, In- 
verse Problems, Greedy Methods, Audio Processing. 

1. INTRODUCTION 

Signal clipping is the corruption of the dynamic range of a 
signal, manifested as a corruption of the signal magnitude at 
some boundary level. This phenomenon usually occurs dur- 
ing the very first stages of signal recording, typically when the 
input range of a device is not sufficiently large as in AID con- 
verters or when the response of a system is not linear beyond 
a certain level. Quite naturally, the task of "declipping" such 
a corrupted signal has therefore attracted a lot of research re- 
cently in the signal processing community, and in particular, 
in the audio restoration field - bearing in mind the sensitivity 
of the human ear to unnatural sound artifacts. 

There exist actually two main kinds of clipping models: 
hard and soft clipping. The first, as illustrated in Fig.[T] sim- 
ply replaces the saturated signal amplitudes by some constant 
saturation level, while the second, which is not treated in this 
paper, corresponds to a reduction of the amplitude gain be- 
yond this level. Once a clipped audio signal is replayed, hu- 
mans perceive it as an unnatural and unpleasantly distorted 
sound. In music, for instance, all notes of a clipped sound 
seem loud, because both soft and loud notes are clipped to the 
same level, which reduces the auditive contrast. 

This paper presents a novel iterative method for signal de- 
clipping. As explained in Sec.|2] this is based on the premise 

LJ and CDV are supported by the Belgian FRS-FNRS fund. NM, MPH 
and AS are supported by NXP software, Leuven. Part of this work has been 
funded by the SPORTIC project (WIST3), Walloon Region, Belgium. 



e >^ X 



T 










—T 


\-Z 



J 

-0 : 



Fig. 1: Hard-clipping example: x = x(t) is the original signal (light curve) 
and Xc(t) is its clipped version (dark curve). The notations are explained 
in Sec. [5] 



that the underlying signal has a sparse structure in a conve- 
nient representation. After introducing a model for the clip- 
ping scenario in Sec. [3] Sec. |4] defines the declipping opera- 
tion as an inverse problem regularized by the sparsity assump- 
tion and which stays consistent with the whole clipping pro- 
cess. The main interest of this formulation is to provide guide- 
lines for developing a consistent variant of the Iterative Hard 
Thresholding 1 1 1 adapted to the non-linear clipping alteration. 
After a brief review of the literature in the field (Sec. [5]), the 
efficiency of this approach is finally demonstrated on syn- 
thetic and actual audio signal restoration in comparison with 
state-of-the-art methods (Sec. [6]). 

Conventions: Most of domain dimensions (e.g., M, N) are 
denoted by capital roman letters. Vectors and matrices are 
associated to bold symbols while lowercase light letters are 
associated to scalar values. The i^^ component of a vector u 
is Ui or {u)i. The identity matrix is I. Vectors of zeros and 
ones are denoted by and 1 respectively. The set of indices in 
is [D] = {1, • • • ,D}. Scalar product between two vec- 
tors u^v e reads u^v = (u^v) (using the transposition 
(•)^). For any p ^ 1, || • ||p represents the ^^-norm such that 
ll^ll^ = E^ l^il^ with ||n|| = ||n||2 and ||n|U = max, \u^\. 
The £o "norm" is ||n||o = #suppn, where # is the cardi- 
nality operator and suppn = {i : Ui ^ 0} C [D]. For 
S C [D], u\s G (or ^l^) denotes the vector (resp. the 
matrix) obtained by retaining the components (resp. columns) 
of n G (resp. $ G R^'><^) belonging to 5 C [D]. 
Alternatively, u\s = Rs^^ or = where Rs := 

(1 1 5)^ G {0, 1}^*^^^ is the restriction operator. We denote 
by (cc)+ the positive thresholding = {xi -\- \xi\)/2, 

while the negative counterpart reads (ic)_ = — (— ic)+. 



2. SPARSE SIGNAL REPRESENTATION 

The vast majority of real-Hfe audio signals have compressible 
structures, meaning that these signals may be represented or 
approximated as the linear combination of few elements taken 
in a set of elementary wave forms (e.g. , DCT, Wavelets 1 2 , 3 1). 

Mathematically, this sparsity concept is applied to 1-D 
temporal signals (e.g., audio) as follows. We assume that, 
within a certain time window T C M, a continuous signal 
x(t) has been sampled with N regular samples gathered in 
a column vector x G M^. We consider then that for an ap- 
propriate sparsity basis ^ G R^^^ with N ^ D, x can be 
described as 

X ^ ^a, with ||a||o <C N and \\x - ^a|| <C \\x\\. (1) 

When ^ is an orthonormal basis with D = N, there exists 
only one vector a* satisfying x = ^a*. A sparse coeffi- 
cient vector answering the problem ([T]) is found by taking the 
best i^T-term approximation of a* given a fixed sparsity level 
K <^ N. In other words, a aj^ 1-Lk{ol*) where 1-Lk 
is the i^T-term thresholding operator setting to zero all but the 
K highest-magnitude coefficients of a*. If I) > N , there are 
many coefficient vectors whose re- synthesis with ^ approxi- 
mates X. This redundancy is often useful to select sparser co- 
efficient vector. Despite the NP-hardness of ([T]) |4|, "relaxed" 
optimization methods and greedy algorithms exist in order to 
find such sparse vectors under additional requirement on ^ 

EllSl. 

Amongst them, the Iterative Hard Thresholding (IHT) of- 
fers interesting advantages like fast convergence and provable 
sparse decomposition guarantees 1 1 1. If a signal x is assumed 
iC-sparse in ^ G M^^^ (with K <C N), this algorithm is de- 
signed to find one minimizer of a Lasso-type | 7 1 restatement 
of ([!]): 

a = argmin^llic - ^af s.t. ||a||o ^ i^. (2) 

IHT approximates the solution of ^ by performing the fol- 
lowing iterative evaluation: 

a^^+i) = UkIcx^''^ + ^^{x - ^a(^))], a^^^ = 0. (3) 

In words, at each iteration, this algorithm hard-thresholds the 
previous solution updated by a gradient descent on the fidelity 
cost of ([2]). 

The IHT procedure is very general and is also applica- 
ble to the recovery of signals indirectly observed by a sens- 
ing matrix ^ G R^^^, as in the Compressed Sensing (CS) 
framework Olgl. In such case, the algorithm above simply 
undergoes the replacement ^ ^ G R^^^ for integrat- 
ing this sensing. 

3. CLIPPING MODEL 

Assuming a symmetric clipping (as in Fig.[T]) associated to a 
clipping threshold r > 0, the (hard) clipping operation Cr is 
mathematically defined as: 

Xc = Cr{x) := min(|ic|,r)sign(ic), (4) 



where all the operations are applied component wise on x. 

From this clipping operation, we can actually define dif- 
ferent sets of samples in x. The set of reliable data, those 
which are not subject to clipping, is Vtr = {i ^ [N] : \xi\ < 
r}, while the clipped index set = G [N] : \xi\ ^ r} can 
be split into two disjoint subsets = {i e [N] : ±Xi ^ r}, 
with nc = ^t ^ and nr^^c = [N]. 

In this work, we assume that these sets are known. This 
happens for instance if the clipping process is hard and not 
corrupted by a strong noise, in which case all the previous 
sets can be deduced from the observation ofXc. 

The knowledge of the sets Qr and simplifies the cor- 
responding forward model (|4]), i.e., 

Xc = Mq^x + rM^+ 1 - tMI- 1, (5) 

with Ms = Rs^s ^ R^^^ is a diagonal masking matrix, 
i.e., (Msu)i = Ui if i e S C [N] and otherwise. 

4. DECLIPPING INVERSE PROBLEM 

A naive approach to the declipping problem is to use directly 
the IHT algorithm ^ on the partial observation of the un- 
dipped signal samples, i.e., considering Xc\q^ = Rn^x as 
the partial observations of x realized by the sensing matrix 
$ = Rn^^. As explained in |[lOl[ni, the downfall of this 
method is that it does not take into account the information 
contained in the clipped samples, namely the clipping thresh- 
old (r) and (possibly) the uppermost absolute magnitude (0). 

Therefore, in this work, inspired by the ideal objective ^ 
targeted by IHT, we address the following inverse problem 

a = argmin l\\B{Xc - ^oc)f s.t. ||a||o ^ K, (6) 

where the corresponding reconstructed signal is x = ^d. 
The key function B : R^ R^ involved in ^ reads 

B{u) = Mn^u + {M^+u)+ + {M^-u).. (7) 

Minimizing E(x) := ^\\B(Xc — forces a signal can- 
didate X = ^ a to be consistent with the observed clipped 
signal Xc. Indeed, from ([7]), this cost can be split in three 
parts, i.e., 

£(x) :=l\\MnAx,-x)f 

+ ^||M^+(rl - xUf + l\\M^-(-rl - x)_f. (8) 

A small first term promotes the candidate signal to match 
the observed clipped signal in the undipped domain, while, 
thanks to the + or — thresholding functions, having a minimal 
second (or third) term enforces x to be bigger (resp. smaller) 
than r (resp. — r) on the set 1]+ (resp. ^~). 

A few remarks can be made on ([6]). First, this declipping 
program corresponds intuitively to picking from the set of all 
signals consistent with the clipped observation Xc, one which 
has a sparsity smaller than K. Provided the original signal 
X (approximately) respects this sparsity requirement, the de- 
clipping program will have a solution. 



Second, we impose actually a strict sparsity model on 
a parametrized by a sparsity order K <^ N assumed opti- 
mal. We will see later how we can estimate this value. No- 
tice also that additional constraints can be added, such as the 
knowledge that the original signal has a bound amplitude (see 
Fig.[T]), i.e., we can additionally impose ||^a||oo ^ (e.g., 
as done in ifTQilTTIl ). 

Of course, directly solving ^ is as hard as recovering a 
sparse signal in ([2]). However, a novel iterative hard thresh- 
olding adjusted to ^ can be obtained by following the same 
guidelines than those used to derive IHT from ^ 1 1 1. Since 
{■)\ is differentiable, the cost S{^(3) is actually a smooth 
convex function of /3 G whose gradient reads: 

Consequently, remembering that the internal part of the 
thresholding in ([3]) is a gradient descent, we propose to solve 
a version of Iterative Hard Thresholding adjusted to DeClip- 
ping that we call IHT-DC: 

a(^+^) = UkIcx^''^ + ^i^""^ ^^B{xc - ^a^"))], (9) 

where a*^^^ = 0. The value /i*^"^^ is simply selected by a fast 
1-D convex minimization (e.g., using golden section ifTSl ) of 

5. PRIOR WORKS IN THE FIELD 

Different strategies have been developed so far in the litera- 
ture to address the declipping problem. One of the oldest is 
the Autoregressive (AR) method 1 14|. AR assumes that the 
underlying signal can be modeled as an autoregressive pro- 
cess lU5L and based on that premise it estimates the AR coef- 
ficients and interpolates the missing samples. AR relies thus 
on one specific generative signal model which is well adapted 
to speech signals only. It suffers a lack of flexibility for mod- 
eling and restoring other kind of signals (as music); an issue 
solved by sparsity-driven methods. 

A more recent approach is the Constrained Orthogonal 
Matching Pursuit (cOMP) audio inpainting algorithm |10|. 
This one is fundamentally based on Orthogonal Matching 
Pursuit 1 16] and constrained optimization. In the first stage, 
cOMP discards non-reliable samples from the data and at- 
tempts to detect the optimal basis vectors using only reliable 
samples. In the second stage, clipping constraints are im- 
posed to the chosen basis using some external optimization 
toolbox. Despite reported improvements relatively to the AR 
declipping results, one drawback of this algorithm is that the 
first stage does not take into account the information stored 
in the clipped samples. This may yield to an incorrect sparse 
support estimation impacting the whole method. A second 
drawback is that the number of iterations is directly related 
to the length of the estimated sparsity support, due to the 
fact that OMP gradually increases the support length at each 
iteration. 



The work in flTl solves a variant of ([6]) where the sparsity 
inducing £o "norm" is minimized and the clipping consistency 
is a constraint. This technique uses a reweighted ii mini- 
mization for approaching Iq | 12] and it addresses the short- 
fall of the cOMP. Unfortunately, we were unable to exten- 
sively run the corresponding toolbox in our high dimensional 
audio setting (see Sec. [6]). There exist also some commercial 
declipping softwares like the Adobe® Audition DeClipper . 
However, their black box nature make difficult any theoretical 
comparison with other methods. 

Finally, but not directly related, clipping is associated to 
saturation in compressive data quantization 1 19] or to time- 
frequency signal corruption ll2Qll . while in the extreme case 
where r ^ 0+, the cost f in ([8]) is reminiscent of the energy 
implicitly minimized in the Binary Iterative Hard Threshold- 
ing (BIHT) in the context of 1-bit CS (121. 

6. EXPERIMENTS 




Sparsity Level K 

Fig. 2: Probability of success for accurate recovery under severe (5dB), 
moderate (10 dB) and mild (20 dB) clipping. 

Two sets of the benchmark tests have been conducted, one 
with synthetic signals and the other with actual audio (mu- 
sic) data. For both data types, signals are degraded by hard- 
clipping The performance of the algorithms are measured 
in dB through input and output (restored signal) SNRs. These 
metrics are defined as iSNR = 10 logio(||ic|P/||^c — iCclP) 
and oSNR = 10 logio(||^c|P/||^c ~ ^IP) respectively, where 
X, Xc and x are the original, the clipped and the declipped 
signals, respectively. 

Benchmark on synthetic data: This type of benchmark 
is suitable for sparsity-based declipping algorithms such as 
cOMP and IHT-DC. Synthetic signals have been generated 
in 1^^=1024 random i^-sparse signals in a DCT basis 
^ G R^^^ for various values of 1 ^ K ^ N. These sparse 
signal were defined by first selecting uniformly at random a 
i^T-length support T in [D] and by drawing identically and 
independently the dictionary coefficients of this support ac- 
cording to a normal distribution, the coefficients outside of T 
being set to 0. 

For both reconstruction methods, the "success" of a 
declipped reconstruction has been arbitrarily defined as 
oSNR > 80 dB. For different values of K and for differ- 
ent clipping scenarios, we estimated the probability of the 



accurate recovery by the frequency of success over 100 trials. 
The results for 3 levels of clipping are presented in Fig. |2] 
These are defined as "mild" (iSNR = 20 dB), "moderate" 
(iSNR = 10 dB) and "severe" (iSNR = 5 dB). It appears that 
the proposed method outperforms cOMP in all cases. The 
advantage of using IHT-DC is especially pronounced for the 
severe and moderate clipping, while it becomes smaller for 
the mild clipping level. This is in accordance with the fact 
that, at this level, more reliable observations are available 
and cOMP manages to detect the correct basis vectors more 
often. Interestingly, the global appearance of Fig.[2]is similar 
to the transition recovery diagram existing in sparse signal 
recovery in Compressed Sensing 1 17], the ?/-axis in this figure 
measuring implicitly the amount of reliable observations. 

About the processing time, our programming of IHT-DC 
in Matlab® is fast when the sparsity level is smaller than 
the success/failure transition point reported in Fig.|2] For in- 
stance, under moderate clipping and fovK ^ 512 (the transi- 
tion occurring in K e [576, 768]), the recovery is perfect and 
IHT-DC stops after less than 100 iterations, i.e., less than 2s 
on a 2.93GHz laptop. 

Benchmark on audio data: In this benchmark, we test the 
efficiency of our approach on two actual samples of music 
track, i.e., "You Are My Kind" by Seal & Santana (15s) and 
"Sultans of Swing" by Dire Straits (10s), each one having a 
very different speech and tonal content. Each track is sampled 
at 16kHz (32bits per sample), and the declipping operation 
was realized by individual processing of 75% -overlapping 
temporal slices of length N = 1024, before to re-synthesize 
the full length music signal with a symmetric sinusoidal 
window ifTOl . 

Since the audio signals are not truly sparse, IHT-DC can- 
not be readily applied to restore them in a clipping scenario. 
In order to avoid us to arbitrarily set one sparsity level K,e.g., 
according to the class of audio signals considered, we prefer 
to make the IHT-DC adaptive by enforcing it to "learn" an op- 
timal sparsity level K. Selecting ^ as a two-times redundant 
DCT dictionary for efficient audio signal representations |2|, 
this adaptivity is achieved by gradually increasing the sparsity 
requirement by 1 at each iteration, starting from a low value, 
until the residual between the clipped observations and the 
reclipped reconstructed signal has a sufficiently small energy. 

The SNR gain of this adaptive IHT-DC is then measured 
by the difference between oSNR and iSNR; obviously, it 
should be as high as possible. The values in Table [TJtop) 
show the SNR gains for different clipping levels illustrating 
severe (5dB), moderate (10 dB) and mild (15 dB) signal 
corruption, in comparison with other methods. Once again, 
IHT-DC outperforms all others in this benchmark. Fur- 
thermore, only Adobe software (apart from the proposed 
algorithm) is somewhat successful in improving the clipped 
signal, whereas other methods usually degrade signals even 
more. Overall, IHT-DC was about 5 times faster than cOMP, 
with ratio between processing time and track duration close 
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Table 1: (top) The SNR gain (in dB) clipping for the two audio tracks, 
(bottom) Listening evaluation (16 pers.) for the two music samples. Evalua- 
tion score are between 1 "very poor" and 5 "excellent quality". 

to 60 and 240 for moderate and hard clipping, respectively. 

In addition to the previous SNR gain comparisons, we 
also wish to depict how end-users perceive audio data pro- 
cessed by the different algorithms. Therefore, we have de- 
signed a randomized subjective test as follows. We have con- 
sidered again the 6 clipped audio contents generated by ap- 
plying our 3 clipping scenarios to the 2 music tracks. Then, 5 
resulting versions for each clipped track have been collected: 
the reconstructions obtained for each of the 4 declipping al- 
gorithms, plus the raw clipped version. A blind test has then 
been run independently with 16 subjects. The 6 clipped tracks 
have been considered in a random order (to prevent the "train- 
ing effect"). For each clipped track, the 5 versions were eval- 
uated by the listeners in a random order (preserving objectiv- 
ity). The subjects were asked to score the result between 1 
(very poor) and 5 (excellent). 

The results are presented at the Table [TJbottom). For the 
two music signals the proposed algorithm clearly outperforms 
the competitors. Again, it seems that only Adobe DeClipper 
(Adobe DC) provides some improvement in perceived qual- 
ity, while cOMP and AR methods are in some cases graded 
lower than actual clipped signal. In case of a mild clipping, 
users were in most cases unable to hear the difference be- 
tween audio clips. 

7. CONCLUSION 

A new iterative method for declipping signals has been pre- 
sented. This extends the IHT algorithm by integrating a clip- 
ping consistency during the iterations. Experimental results 
on synthetic and actual audio data have demonstrated the effi- 
ciency of this approach compared to other known techniques. 
In the future, we plan to justify the theoretical conditions for 
guaranteeing the convergence of the IHT-DC. In particular, 
we will analyze how the "phase" transition diagram obtained 
in Fig. |2] can be predicted according to both the clipping and 
the signal sparsity levels. On the practical perspective, an 
important gain could also be obtained by exploiting different 
dictionaries. The redundant DCT used for our tests is a very 
generic dictionary that is not especially well suited for purely 
speech signals, for instance. 
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